Bienvenidos al Taller de Ciencias de Datos Aplicada

Primer Problema

Uno de los grandes problemas de las redes sociales es la creación de perfiles falsos debido a que implica desde creaciones superficiales de seguidores hasta tráfico de influencias y suplantación de identidad, una de las redes sociales que cuenta con más cuentas falsas es Instagram. Para eso, se quiere identificar las características y la probabilidad de un usuario de Instagram sea clasificado como una cuenta 'real' o una cuenta 'spammer'.

Para más información de los datos se encuentran en: https://www.kaggle.com/free4ever1/instagram-fake-spammer-genuine-accounts

Importación de los Datos

Primero, vamos a importar los datos y crear una descripción para entender las variables descritas en los datos. Para importar los datos vamos a utilizar Pandas (Que por defecto está instalado), y para el análisis de datos vamos a usar Pandas-Profiling.
Para instalar un nuevo paquete se puede utilizar cualquiera de los siguientes comandos:

!conda install pandas-profiling -y
!pip install pandas-profiling[notebook]
In [ ]:
 
Collecting graphviz
  Downloading graphviz-0.14-py2.py3-none-any.whl (18 kB)
Installing collected packages: graphviz
Successfully installed graphviz-0.14
In [ ]:
import pandas as pd
import pandas_profiling

train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")

pandas_profiling.ProfileReport(train)



Out[ ]:

Separación de los datos

Es importante siempre tener dos conjuntos de datos, uno de entrenamiento y uno de test. Además es importante dividir los datos en las variables que queremos aprender como X y la respuesta que se desea obtener como Y, Así se realiza con los dos conjuntos de datos.

In [ ]:
X_train, Y_train = train.drop(['fake'],axis=1), train['fake']
X_test, Y_test = test.drop(['fake'],axis=1), test['fake']
display(X_train.head())
profile pic nums/length username fullname words nums/length fullname name==username description length external URL private #posts #followers #follows
0 1 0.27 0 0.0 0 53 0 0 32 1000 955
1 1 0.00 2 0.0 0 44 0 0 286 2740 533
2 1 0.10 2 0.0 0 0 0 1 13 159 98
3 1 0.00 1 0.0 0 82 0 0 679 414 651
4 1 0.00 2 0.0 0 0 0 1 6 151 126

Entrenamiento del Modelo

Se va a utilizar la librería scikit-learn la cual está desarrollado para entrenar modelos simples para tareas de regresión, asociación, agrupación o clasificación hasta entrenamientos más complejos como Máquinas de Soporte Vectorial, Random Forest y Redes neuronales de 1 y 2 capas.
Para más información sobre esta librería, tutoriales y documentación pueden visitar la Página Oficial. Para instalarlo lo puede hacer a través del siguiente código

!pip install scikit-learn
In [ ]:
from sklearn import tree

clf = tree.DecisionTreeClassifier()

clf.fit(X_train, Y_train)
Out[ ]:
DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')
In [ ]:
import graphviz
dot_data = tree.export_graphviz(clf, out_file=None, 
                      feature_names=list(X_train.columns),  
                      class_names=['real','fake'],  
                      filled=True, rounded=True,  
                      special_characters=True)  
graph = graphviz.Source(dot_data)  
graph 
Out[ ]:
Tree 0 #followers ≤ 98.0 gini = 0.5 samples = 576 value = [288, 288] class = real 1 nums/length username ≤ 0.195 gini = 0.147 samples = 250 value = [20, 230] class = fake 0->1 True 34 nums/length username ≤ 0.245 gini = 0.293 samples = 326 value = [268, 58] class = real 0->34 False 2 #followers ≤ 58.0 gini = 0.332 samples = 95 value = [20, 75] class = fake 1->2 33 gini = 0.0 samples = 155 value = [0, 155] class = fake 1->33 3 #posts ≤ 1.5 gini = 0.167 samples = 76 value = [7, 69] class = fake 2->3 22 description length ≤ 11.0 gini = 0.432 samples = 19 value = [13, 6] class = real 2->22 4 profile pic ≤ 0.5 gini = 0.036 samples = 55 value = [1, 54] class = fake 3->4 11 #follows ≤ 23.0 gini = 0.408 samples = 21 value = [6, 15] class = fake 3->11 5 gini = 0.0 samples = 42 value = [0, 42] class = fake 4->5 6 fullname words ≤ 1.5 gini = 0.142 samples = 13 value = [1, 12] class = fake 4->6 7 gini = 0.0 samples = 11 value = [0, 11] class = fake 6->7 8 nums/length username ≤ 0.07 gini = 0.5 samples = 2 value = [1, 1] class = real 6->8 9 gini = 0.0 samples = 1 value = [1, 0] class = real 8->9 10 gini = 0.0 samples = 1 value = [0, 1] class = fake 8->10 12 gini = 0.0 samples = 3 value = [3, 0] class = real 11->12 13 #followers ≤ 43.5 gini = 0.278 samples = 18 value = [3, 15] class = fake 11->13 14 #follows ≤ 37.5 gini = 0.117 samples = 16 value = [1, 15] class = fake 13->14 21 gini = 0.0 samples = 2 value = [2, 0] class = real 13->21 15 nums/length username ≤ 0.05 gini = 0.375 samples = 4 value = [1, 3] class = fake 14->15 20 gini = 0.0 samples = 12 value = [0, 12] class = fake 14->20 16 #follows ≤ 31.5 gini = 0.5 samples = 2 value = [1, 1] class = real 15->16 19 gini = 0.0 samples = 2 value = [0, 2] class = fake 15->19 17 gini = 0.0 samples = 1 value = [0, 1] class = fake 16->17 18 gini = 0.0 samples = 1 value = [1, 0] class = real 16->18 23 #posts ≤ 31.0 gini = 0.497 samples = 13 value = [7, 6] class = real 22->23 32 gini = 0.0 samples = 6 value = [6, 0] class = real 22->32 24 #posts ≤ 11.0 gini = 0.463 samples = 11 value = [7, 4] class = real 23->24 31 gini = 0.0 samples = 2 value = [0, 2] class = fake 23->31 25 #followers ≤ 67.5 gini = 0.5 samples = 8 value = [4, 4] class = real 24->25 30 gini = 0.0 samples = 3 value = [3, 0] class = real 24->30 26 gini = 0.0 samples = 3 value = [0, 3] class = fake 25->26 27 name==username ≤ 0.5 gini = 0.32 samples = 5 value = [4, 1] class = real 25->27 28 gini = 0.0 samples = 4 value = [4, 0] class = real 27->28 29 gini = 0.0 samples = 1 value = [0, 1] class = fake 27->29 35 profile pic ≤ 0.5 gini = 0.163 samples = 279 value = [254, 25] class = real 34->35 80 description length ≤ 9.5 gini = 0.418 samples = 47 value = [14, 33] class = fake 34->80 36 #posts ≤ 14.5 gini = 0.245 samples = 14 value = [2, 12] class = fake 35->36 39 description length ≤ 147.0 gini = 0.093 samples = 265 value = [252, 13] class = real 35->39 37 gini = 0.0 samples = 12 value = [0, 12] class = fake 36->37 38 gini = 0.0 samples = 2 value = [2, 0] class = real 36->38 40 description length ≤ 0.5 gini = 0.075 samples = 256 value = [246, 10] class = real 39->40 73 external URL ≤ 0.5 gini = 0.444 samples = 9 value = [6, 3] class = real 39->73 41 #followers ≤ 146.5 gini = 0.198 samples = 72 value = [64, 8] class = real 40->41 64 nums/length username ≤ 0.19 gini = 0.022 samples = 184 value = [182, 2] class = real 40->64 42 #follows ≤ 256.0 gini = 0.463 samples = 11 value = [7, 4] class = real 41->42 49 #follows ≤ 2258.5 gini = 0.123 samples = 61 value = [57, 4] class = real 41->49 43 #posts ≤ 3.5 gini = 0.219 samples = 8 value = [7, 1] class = real 42->43 48 gini = 0.0 samples = 3 value = [0, 3] class = fake 42->48 44 nums/length username ≤ 0.155 gini = 0.5 samples = 2 value = [1, 1] class = real 43->44 47 gini = 0.0 samples = 6 value = [6, 0] class = real 43->47 45 gini = 0.0 samples = 1 value = [0, 1] class = fake 44->45 46 gini = 0.0 samples = 1 value = [1, 0] class = real 44->46 50 #followers ≤ 314.0 gini = 0.068 samples = 57 value = [55, 2] class = real 49->50 59 fullname words ≤ 2.5 gini = 0.5 samples = 4 value = [2, 2] class = real 49->59 51 #followers ≤ 299.5 gini = 0.219 samples = 16 value = [14, 2] class = real 50->51 58 gini = 0.0 samples = 41 value = [41, 0] class = real 50->58 52 #posts ≤ 34.0 gini = 0.124 samples = 15 value = [14, 1] class = real 51->52 57 gini = 0.0 samples = 1 value = [0, 1] class = fake 51->57 53 gini = 0.0 samples = 11 value = [11, 0] class = real 52->53 54 #followers ≤ 224.0 gini = 0.375 samples = 4 value = [3, 1] class = real 52->54 55 gini = 0.0 samples = 1 value = [0, 1] class = fake 54->55 56 gini = 0.0 samples = 3 value = [3, 0] class = real 54->56 60 nums/length fullname ≤ 0.05 gini = 0.444 samples = 3 value = [1, 2] class = fake 59->60 63 gini = 0.0 samples = 1 value = [1, 0] class = real 59->63 61 gini = 0.0 samples = 2 value = [0, 2] class = fake 60->61 62 gini = 0.0 samples = 1 value = [1, 0] class = real 60->62 65 #follows ≤ 1116.5 gini = 0.011 samples = 179 value = [178, 1] class = real 64->65 70 description length ≤ 29.0 gini = 0.32 samples = 5 value = [4, 1] class = real 64->70 66 gini = 0.0 samples = 152 value = [152, 0] class = real 65->66 67 #follows ≤ 1139.5 gini = 0.071 samples = 27 value = [26, 1] class = real 65->67 68 gini = 0.0 samples = 1 value = [0, 1] class = fake 67->68 69 gini = 0.0 samples = 26 value = [26, 0] class = real 67->69 71 gini = 0.0 samples = 1 value = [0, 1] class = fake 70->71 72 gini = 0.0 samples = 4 value = [4, 0] class = real 70->72 74 #follows ≤ 173.0 gini = 0.48 samples = 5 value = [2, 3] class = fake 73->74 79 gini = 0.0 samples = 4 value = [4, 0] class = real 73->79 75 gini = 0.0 samples = 1 value = [1, 0] class = real 74->75 76 #follows ≤ 2529.0 gini = 0.375 samples = 4 value = [1, 3] class = fake 74->76 77 gini = 0.0 samples = 3 value = [0, 3] class = fake 76->77 78 gini = 0.0 samples = 1 value = [1, 0] class = real 76->78 81 #followers ≤ 102.5 gini = 0.17 samples = 32 value = [3, 29] class = fake 80->81 92 #followers ≤ 181.5 gini = 0.391 samples = 15 value = [11, 4] class = real 80->92 82 gini = 0.0 samples = 1 value = [1, 0] class = real 81->82 83 #follows ≤ 167.0 gini = 0.121 samples = 31 value = [2, 29] class = fake 81->83 84 fullname words ≤ 1.5 gini = 0.444 samples = 3 value = [1, 2] class = fake 83->84 87 fullname words ≤ 0.5 gini = 0.069 samples = 28 value = [1, 27] class = fake 83->87 85 gini = 0.0 samples = 2 value = [0, 2] class = fake 84->85 86 gini = 0.0 samples = 1 value = [1, 0] class = real 84->86 88 #posts ≤ 17.5 gini = 0.375 samples = 4 value = [1, 3] class = fake 87->88 91 gini = 0.0 samples = 24 value = [0, 24] class = fake 87->91 89 gini = 0.0 samples = 3 value = [0, 3] class = fake 88->89 90 gini = 0.0 samples = 1 value = [1, 0] class = real 88->90 93 gini = 0.0 samples = 2 value = [0, 2] class = fake 92->93 94 nums/length username ≤ 0.26 gini = 0.26 samples = 13 value = [11, 2] class = real 92->94 95 gini = 0.0 samples = 1 value = [0, 1] class = fake 94->95 96 fullname words ≤ 0.5 gini = 0.153 samples = 12 value = [11, 1] class = real 94->96 97 private ≤ 0.5 gini = 0.5 samples = 2 value = [1, 1] class = real 96->97 100 gini = 0.0 samples = 10 value = [10, 0] class = real 96->100 98 gini = 0.0 samples = 1 value = [1, 0] class = real 97->98 99 gini = 0.0 samples = 1 value = [0, 1] class = fake 97->99
In [ ]:
from sklearn import metrics

report_train = metrics.classification_report(Y_train, clf.predict(X_train))
report_test = metrics.classification_report(Y_test, clf.predict(X_test))

print("Train Report \n", report_train)
print("Test Report \n", report_test)
Train Report 
               precision    recall  f1-score   support

           0       1.00      1.00      1.00       288
           1       1.00      1.00      1.00       288

    accuracy                           1.00       576
   macro avg       1.00      1.00      1.00       576
weighted avg       1.00      1.00      1.00       576

Test Report 
               precision    recall  f1-score   support

           0       0.90      0.92      0.91        60
           1       0.92      0.90      0.91        60

    accuracy                           0.91       120
   macro avg       0.91      0.91      0.91       120
weighted avg       0.91      0.91      0.91       120

In [ ]:
clf_lim = tree.DecisionTreeClassifier(max_depth=5)
clf_lim.fit(X_train, Y_train)

dot_data = tree.export_graphviz(clf_lim, out_file=None, 
                      feature_names=list(X_train.columns),  
                      class_names=['real','fake'],  
                      filled=True, rounded=True,  
                      special_characters=True)  
graph = graphviz.Source(dot_data)  
graph 
Out[ ]:
Tree 0 #followers ≤ 98.0 gini = 0.5 samples = 576 value = [288, 288] class = real 1 nums/length username ≤ 0.195 gini = 0.147 samples = 250 value = [20, 230] class = fake 0->1 True 16 nums/length username ≤ 0.245 gini = 0.293 samples = 326 value = [268, 58] class = real 0->16 False 2 #followers ≤ 58.0 gini = 0.332 samples = 95 value = [20, 75] class = fake 1->2 15 gini = 0.0 samples = 155 value = [0, 155] class = fake 1->15 3 #posts ≤ 1.5 gini = 0.167 samples = 76 value = [7, 69] class = fake 2->3 10 description length ≤ 11.0 gini = 0.432 samples = 19 value = [13, 6] class = real 2->10 4 profile pic ≤ 0.5 gini = 0.036 samples = 55 value = [1, 54] class = fake 3->4 7 #follows ≤ 23.0 gini = 0.408 samples = 21 value = [6, 15] class = fake 3->7 5 gini = 0.0 samples = 42 value = [0, 42] class = fake 4->5 6 gini = 0.142 samples = 13 value = [1, 12] class = fake 4->6 8 gini = 0.0 samples = 3 value = [3, 0] class = real 7->8 9 gini = 0.278 samples = 18 value = [3, 15] class = fake 7->9 11 #posts ≤ 31.0 gini = 0.497 samples = 13 value = [7, 6] class = real 10->11 14 gini = 0.0 samples = 6 value = [6, 0] class = real 10->14 12 gini = 0.463 samples = 11 value = [7, 4] class = real 11->12 13 gini = 0.0 samples = 2 value = [0, 2] class = fake 11->13 17 profile pic ≤ 0.5 gini = 0.163 samples = 279 value = [254, 25] class = real 16->17 28 description length ≤ 9.5 gini = 0.418 samples = 47 value = [14, 33] class = fake 16->28 18 #posts ≤ 14.5 gini = 0.245 samples = 14 value = [2, 12] class = fake 17->18 21 description length ≤ 147.0 gini = 0.093 samples = 265 value = [252, 13] class = real 17->21 19 gini = 0.0 samples = 12 value = [0, 12] class = fake 18->19 20 gini = 0.0 samples = 2 value = [2, 0] class = real 18->20 22 description length ≤ 0.5 gini = 0.075 samples = 256 value = [246, 10] class = real 21->22 25 #posts ≤ 77.5 gini = 0.444 samples = 9 value = [6, 3] class = real 21->25 23 gini = 0.198 samples = 72 value = [64, 8] class = real 22->23 24 gini = 0.022 samples = 184 value = [182, 2] class = real 22->24 26 gini = 0.48 samples = 5 value = [2, 3] class = fake 25->26 27 gini = 0.0 samples = 4 value = [4, 0] class = real 25->27 29 #followers ≤ 102.5 gini = 0.17 samples = 32 value = [3, 29] class = fake 28->29 34 #followers ≤ 181.5 gini = 0.391 samples = 15 value = [11, 4] class = real 28->34 30 gini = 0.0 samples = 1 value = [1, 0] class = real 29->30 31 #follows ≤ 167.0 gini = 0.121 samples = 31 value = [2, 29] class = fake 29->31 32 gini = 0.444 samples = 3 value = [1, 2] class = fake 31->32 33 gini = 0.069 samples = 28 value = [1, 27] class = fake 31->33 35 gini = 0.0 samples = 2 value = [0, 2] class = fake 34->35 36 nums/length username ≤ 0.26 gini = 0.26 samples = 13 value = [11, 2] class = real 34->36 37 gini = 0.0 samples = 1 value = [0, 1] class = fake 36->37 38 gini = 0.153 samples = 12 value = [11, 1] class = real 36->38
In [ ]:
report_train = metrics.classification_report(Y_train, clf_lim.predict(X_train))
report_test = metrics.classification_report(Y_test, clf_lim.predict(X_test))

print("Train Report \n", report_train)
print("Test Report \n", report_test)
Train Report 
               precision    recall  f1-score   support

           0       0.95      0.97      0.96       288
           1       0.97      0.95      0.96       288

    accuracy                           0.96       576
   macro avg       0.96      0.96      0.96       576
weighted avg       0.96      0.96      0.96       576

Test Report 
               precision    recall  f1-score   support

           0       0.82      0.93      0.87        60
           1       0.92      0.80      0.86        60

    accuracy                           0.87       120
   macro avg       0.87      0.87      0.87       120
weighted avg       0.87      0.87      0.87       120

In [ ]: